<<<<<<< HEAD

3DS : Cherry Blossom

Lyuda Bekwinknoll, Meghana Cyanam, Theresa Marie Duenas, Kevin Kiser

With our data visualization we are determining the association between age and fitness based on running data from the Cherry Blossom Ten-mile Run held in Washington DC from 1973 to 2020.

Loading and Cleaning Data

<<<<<<< HEAD
# load packages
library(dplyr)
library(ggplot2)
library(chron)
library(plotly)
library(purrr)
library(RColorBrewer)
library(readr)
cb_10M_1973_2005_weather <- read_csv("cb_10M_1973_2005_weather.csv", show_col_types = FALSE)

cb_10M_2006_2019_weather <- read_csv("cb_10M_2006_2019_weather.csv", show_col_types = FALSE)

clean_data <- function(df) {
  df <- dplyr::select(df, Year, Name, Age, 
               Time, Division, pos_by_sex, 
               total_by_sex, Sex, PRCP, TMAX, TMIN)

  df$Age[df$Age == 'NR'] <- 0
  df$Time[df$Time == 'NR'] <- 0

df <- df %>%
    mutate(
      Year = as.integer(Year),
      Name = as.character(Name),
      Age = as.integer(Age, na.rm = TRUE),
      Time = as.character(Time),
      Division = as.character(Division),
      pos_by_sex = as.integer(pos_by_sex),
      total_by_sex = as.integer(total_by_sex),
      Sex = as.character(Sex),
      PRCP = as.numeric(PRCP),
      TMAX = as.integer(TMAX),
      TMIN = as.integer(TMIN)
    )
  
  return(df)
}

df1 <- clean_data(cb_10M_1973_2005_weather)
df2 <- clean_data(cb_10M_2006_2019_weather)


df <- bind_rows(df1, df2)
# fixes times with mins recorded as hrs (mm:ss:00 -> 00:mm:ss)
df <- df %>%
  mutate(
    Time = ifelse(
      Time >= "43:00:00" & Time <= "59:59:59",
      paste0("00:", substr(Time, 1, 2), ":", substr(Time, 4, 5)),
      Time
    )
  )


df <- df %>%
  filter(!is.na(Time) & Time != "" & Time != "00:00:00") %>%
  mutate(Time = chron::times(Time)) %>%
  filter(!is.na(Time)) %>%
  filter(Time >= "00:43:00" & Time <= "02:20:00") %>%
  filter(Age >= 8 & Age <= 87) %>%
  filter(Year != 1977) %>%
  filter(Year != 1973) %>%
  mutate(
    # replaces 'W' with 'F' in sex col
    Sex = ifelse(Sex == 'W', 'F', Sex),
    # correction for missing values by first character in division
    Sex = ifelse(substr(Division, 1, 1) == 'W', 'F', Sex),
    Sex = ifelse(substr(Division, 1, 1) == 'M', 'M', Sex)
  )
# add final data to git repository
# write.csv(df, "cleaned_10M_1973_2019.csv", row.names = FALSE)

unique(df$Division)
##  [1] "M3034" "M0119" "M2024" "M2529" "M3539" "M4549" "M4044" "M5559" "M5054"
## [10] "M6064" "M6569" "M7074" "W2529" "W0119" "W3034" "W2024" "W5054" "W3539"
## [19] "W4549" "W4044" "M7579" "W5559" "W6064" "M8099" "W6569" "W7074" "W7579"
## [28] "W8099"
======= >>>>>>> 70cad8d358d80d714587be8d5a55e1cf6830e8d8

Describing our Data

<<<<<<< HEAD ======= >>>>>>> 70cad8d358d80d714587be8d5a55e1cf6830e8d8
Variable Names Data Type Variable Descriptions
Year Integer Year the race was held.
Name Character

An individual’s first and last name with varying formats. Most of the CUCB website results for names also list an 'M', 'F', or 'W' in parenthesis for the individual's sex.

example: James Yenckel (M)

Age Integer Age of runner at time of race.
Time Time/Numeric Time in hr:min:sec format to run 10 miles. This is how long it took each runner to complete the race.
Division Character

28 different divisions are contained, 14 in each sex. They range from 4 of them having 20 year ranges, while the rest have 5 year ranges. Each division is an alphanumeric code separating competitors by sex and age. The example shows 25 to 29-year-old women.

example: W2529

Groupings based on age and gender.
pos_by_sex Integer Shows the place that a runner finished by sex per year.
total_by_sex Integer The total number of competitors overall for a sex per year.
Sex Character Gender of runner.
PRCP Numeric Precipitation recorded as daily rainfall in inches to one decimal place collected by NOAA.
TMAX Integer Minimum daily temperature recorded in Fahrenheit, collected by NOAA.
TMIN Integer Maximum daily temperature recorded in Fahrenheit, collected by NOAA

Dataset Overview:

In the original data set we have 347402 rows and 17 columns. After cleaning the data set we ended up with 339934 rows and 11 columns. 7468 rows of data were omitted from the data we used because they had missing values for the time and/or age variables. Below is the description of the variables and data we excluded for our data analysis/visualization:

What was excluded Reason for exclusion
Hometown Many missing values and inconsistencies were found in the data entries. We found a few individuals reporting their hometown differently each time they ran the race or just reporting several at once. Due to this we decided to remove this variable from our analysis because there are no accurate conclusions that can be drawn. Also this is not a variable we could use to fulfill our main objective, so we chose to exclude it from our analysis.
Distance The data in this column was describing the race of this length which is 10 miles. Since we already know it is the data from the 10 mile race, having a column that explicitly states that for row of our data is redundant.
Date We decided to exclude this variable since we know that the race happens at a certain time each year during spring, and having the specific dates would not impact our data question in any way.
pos_by_div This variable gives us the position that a runner finished in their assigned division for a certain year. There are 28 different divisions, 14 per sex. Each of these divisions includes a age range which differs from 20 years to 5 years. We decided to exclude this variable since the same information could be obtained from the pos_by_sex variable and we could instill our own age ranges.
total_by_division This variable gives the total number of individuals in each division for a certain year. The divisions are the same as described above and are excluded for the same reason as above.
Pace The Pace gave the pace per mile of each runner for the race. We decided to exclude this from our analysis because of the fact it wasn’t reading in correctly. Also the pace can be calculated directly from the Time variable by dividing it by 10 (the total miles in the race). Therefore we decided to remove this column of data from our data frame.
Data from the year of 1977 We decided to remove the data from the year of 1977 due to the fact that there was a large chunk of data missing from the times right in the middle of the race time. We are not given information about what happened in that period that resulted in such record, so we don’t have any background about that. Also if we keep this year in our data analysis, is has a potential to make our data analysis biased since there are a lot of points missing from a main part of the times, which would lead to inaccuracy in our interpretations. That is why we have decided to exclude the year from our data.

Summary Statistics:

Year, Age, Time, Sex main variables to focus on.

Checklist for this section:

summary stats: mean, median, mode, range, sd, percentiles, distributions by sex variable, etc.

mention how many women and how many men in each year and overall

summary.data.frame(df)
##       Year          Name                Age            Time         
##  Min.   :1974   Length:339214      Min.   : 8.0   Min.   :00:43:20  
##  1st Qu.:2001   Class :character   1st Qu.:29.0   1st Qu.:01:19:35  
##  Median :2009   Mode  :character   Median :35.0   Median :01:30:50  
##  Mean   :2006                      Mean   :36.6   Mean   :01:31:25  
##  3rd Qu.:2015                      3rd Qu.:43.0   3rd Qu.:01:42:22  
##  Max.   :2019                      Max.   :87.0   Max.   :02:20:00  
##                                                                     
##    Division           pos_by_sex     total_by_sex       Sex           
##  Length:339214      Min.   :    1   Min.   :   27   Length:339214     
##  Class :character   1st Qu.: 1109   1st Qu.: 3513   Class :character  
##  Mode  :character   Median : 2445   Median : 6792   Mode  :character  
##                     Mean   : 3134   Mean   : 6298                     
##                     3rd Qu.: 4739   3rd Qu.: 9030                     
##                     Max.   :11042   Max.   :11042                     
##                     NA's   :6       NA's   :6                         
##       PRCP             TMAX           TMIN      
##  Min.   :0.0000   Min.   :44.0   Min.   :32.00  
##  1st Qu.:0.0000   1st Qu.:56.0   1st Qu.:39.00  
##  Median :0.0000   Median :64.0   Median :43.00  
##  Mean   :0.0538   Mean   :63.3   Mean   :43.11  
##  3rd Qu.:0.0500   3rd Qu.:70.0   3rd Qu.:47.00  
##  Max.   :0.9300   Max.   :84.0   Max.   :58.00  
## 

These were helping me evaluate the data cleaning, we can fix or replace them later

<<<<<<< HEAD
x <- df$Time
fit <- density(x)

custom_ticks <- c("00:42", "01:13", "01:45", "02:17", "02:49")
tick_positions <- seq(min(x), max(x), length.out = length(custom_ticks))
tick_labels <- as.numeric(tick_positions)

plot_ly(x = x, type = "histogram", name = "Histogram") %>% 
  add_trace(x = fit$x, y = fit$y, type = "scatter", mode = "lines", fill = "tozeroy", yaxis = "y2", name = "Density Curve") %>% 
  layout(title = "Distribution and Density Curve \n All Years",
         xaxis = list(title = "Time (hr:min)", showline = TRUE, 
                      tickvals = tick_positions, ticktext = custom_ticks),         
         yaxis2 = list(title = "Density", showline = TRUE, overlaying = "y"),
         yaxis = list(showticklabels = FALSE, side = "left"))